Skip to content

Add ThreadSanitizer CI jobs (Linux + macOS)#148

Merged
bghgary merged 16 commits intoBabylonJS:mainfrom
bghgary:feature/thread-sanitizer
Apr 17, 2026
Merged

Add ThreadSanitizer CI jobs (Linux + macOS)#148
bghgary merged 16 commits intoBabylonJS:mainfrom
bghgary:feature/thread-sanitizer

Conversation

@bghgary
Copy link
Copy Markdown
Contributor

@bghgary bghgary commented Apr 7, 2026

Add ThreadSanitizer (TSan) support for detecting data races at runtime.

[Created by Copilot on behalf of @bghgary]

Changes

  • CMakeLists.txt: New ENABLE_THREAD_SANITIZER option with -fsanitize=thread. Includes a mutual exclusion check with ENABLE_SANITIZERS since TSan and ASan cannot be combined.
  • .github/workflows/build-linux.yml: New enable-thread-sanitizer input, passed as ENABLE_THREAD_SANITIZER to CMake. Wires TSAN_OPTIONS with suppression file and sets JSC_useConcurrentGC=0 for the Run Tests step (see below).
  • .github/workflows/build-macos.yml: New enable-thread-sanitizer input; wires TSAN_OPTIONS.
  • .github/workflows/ci.yml: New Ubuntu_ThreadSanitizer_clang and macOS_Xcode164_ThreadSanitizer jobs.
  • .github/tsan_suppressions.txt: Suppresses JSC-internal data races on Ubuntu (called_from_lib:libjavascriptcoregtk). These are allocator-level races in JSC's JIT worker threads that we cannot fix.

Linux TSan: Concurrent GC Workaround

JSC on Linux uses pthread_kill(tid, SIGUSR1) + sem_wait in WTF::Thread::suspend() to stop mutator threads at GC safepoints. TSan's signal interception defers SIGUSR1 delivery indefinitely when the target is inside instrumented code; the handler never runs and the Collector Thread's sem_wait deadlocks. macOS JSC uses Mach thread_suspend() (no Unix signals) and is unaffected.

Setting JSC_useConcurrentGC=0 on the Linux TSan job removes the dedicated Collector Thread; GC runs on the mutator without cross-thread signals. Reproduced locally in WSL Ubuntu: default configuration hung 9/20 runs; with JSC_useConcurrentGC=0, 0/30 runs hung.

Dependency Updates

Platform Support

TSan is supported on Linux and macOS with Clang/GCC. MSVC does not support TSan and Clang targeting Windows does not have a TSan runtime library.

CI Impact

The TSan jobs run in parallel with other jobs and do not increase overall pipeline time.

Status

  • macOS TSan: passes clean
  • Ubuntu TSan: JSC-internal races suppressed via called_from_lib:libjavascriptcoregtk; concurrent GC disabled to avoid TSan/signal deadlock

@bghgary bghgary marked this pull request as ready for review April 8, 2026 16:10
Copilot AI review requested due to automatic review settings April 8, 2026 16:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds ThreadSanitizer (TSan) support to the build and CI to detect data races at runtime, integrating it alongside existing sanitizer infrastructure.

Changes:

  • Added ENABLE_THREAD_SANITIZER CMake option and wiring to apply -fsanitize=thread compile/link flags (with mutual exclusion vs ENABLE_SANITIZERS).
  • Extended the Linux CI template to accept an enableThreadSanitizer parameter and pass it through to CMake.
  • Added a new Azure Pipelines Linux/Clang TSan job.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 1 comment.

File Description
CMakeLists.txt Adds a dedicated CMake toggle for ThreadSanitizer and applies the required toolchain flags.
.gitignore Ignores the UnitTests dist/ output directory.
.github/jobs/linux.yml Adds a pipeline parameter/variable to enable TSan and passes it into the CMake configure step.
.github/azure-pipelines.yml Introduces a new Ubuntu/Clang ThreadSanitizer CI job using the Linux job template.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .github/azure-pipelines.yml Outdated
@bghgary bghgary changed the title Add ThreadSanitizer CI job Add ThreadSanitizer CI jobs (Linux + macOS) Apr 8, 2026
@bghgary bghgary marked this pull request as draft April 8, 2026 16:17
bghgary and others added 5 commits April 16, 2026 11:45
Add ENABLE_THREAD_SANITIZER CMake option for detecting data races at
runtime. TSan cannot be combined with ASan, so it's a separate option
with a mutual exclusion check.

Adds an Ubuntu_ThreadSanitizer_clang CI job that runs the full test
suite under TSan.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
MSVC does not support ThreadSanitizer. Added macOS CI job and
enableThreadSanitizer parameter to macos.yml.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ting

arcana.cpp: bghgary/arcana.cpp fix/tsan-callback-race
UrlLib: bghgary/UrlLib fix/tsan-websocket-race

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add tsan_suppressions.txt to suppress JavaScriptCore/WTF/bmalloc
  internal data races on Ubuntu (not fixable by us)
- Wire suppressions into linux.yml for TSan jobs via TSAN_OPTIONS
- Point arcana.cpp at upstream (microsoft/arcana.cpp#61 merged)
- Update UrlLib fork SHA (post-rebase, pending BabylonJS/UrlLib#27)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bghgary bghgary force-pushed the feature/thread-sanitizer branch from afa00dc to 2df3c81 Compare April 16, 2026 18:47
bghgary and others added 7 commits April 16, 2026 12:35
The JSC races manifest in TSan's allocator interceptors (free/malloc/memcpy)
called from JSC's JIT worker threads. Function-name suppressions (race:WTF::)
don't match because the top frame is a libc function. Use called_from_lib
to suppress any race with libjavascriptcoregtk in the call stack.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
TSan instrumentation slows tests significantly. The 15-minute default
caused the Ubuntu_TSan job to be canceled mid-test. Tests were passing
but ran out of time.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Inline if/else/endif directives aren't allowed when embedded in a
string value. Use block-level conditional insertion instead.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- race:free, race:close, race:memcpy for TSan interceptors where the
  top frame is attributed to UnitTests, not libjavascriptcoregtk
- signal:libjavascriptcoregtk for JSC signal handler errno issues

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread .github/jobs/linux.yml Outdated
Switches UrlLib source from bghgary/UrlLib fork back to BabylonJS/UrlLib
now that the Apple WebSocket data race fixes are merged upstream
(BabylonJS/UrlLib#27).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
bghgary added a commit that referenced this pull request Apr 16, 2026
…SED (#160)

[Created by Copilot on behalf of @bghgary]

## Problem

The WebSocket polyfill throws JavaScript errors in two places where the
[WHATWG WebSocket specification](https://websockets.spec.whatwg.org/)
requires different behavior:

### `close()` throws when already closing/closed

`WebSocket::Close` throws `Error: Close has already been called.` when
invoked on a socket whose `readyState` is already `CLOSING` or `CLOSED`.
Per the spec, `close()` in those states must be a silent no-op.

### `send()` throws when closing/closed

`WebSocket::Send` throws `Error: Websocket readyState is not open.` on
any non-OPEN state. Per the spec, `send()` should only throw
`InvalidStateError` when `readyState` is `CONNECTING`. When `CLOSING` or
`CLOSED`, the data is silently discarded (the spec also bumps
`bufferedAmount`, which this polyfill does not track).

Because the throws are synchronous from a JS call, a delayed or racing
call from an async callback surfaces as an **uncaught** exception that
terminates the JS runtime.

## Repro path

Observed on the macOS_Xcode164_Sanitizers leg of #148. Sequence (from
the existing multi-WebSocket test in
`Tests/UnitTests/Scripts/tests.ts`):

1. Test calls `ws.close()` on an open socket.
2. The native side transitions `readyState` -> `Closing` and issues the
close.
3. Before the `onclose` callback marshals to the JS thread, another path
(a pending `onmessage` handler, or a `catch` block on a failed `send`)
calls `ws.close()` again.
4. The second `close()` finds `readyState == Closing` and **throws**,
producing:

```
[Uncaught Error] close@[native code]
@app:///Scripts/tests.js:27799:22
```

The `send()` path is symmetric: a send scheduled between `close()` being
called and the `onclose` callback firing would throw and escape
uncaught.

## Fix

### `close()`

Replace the `throw` with an early `return`. No other state is touched,
so the first `close()` still drives the transition to `CLOSED` via the
normal callback path.

### `send()`

Split the single non-OPEN throw into two cases: throw on `CONNECTING`,
return silently on `CLOSING`/`CLOSED`.

## Scope

- Pure behavioral fix in the WebSocket polyfill. No changes outside
`Polyfills/WebSocket/Source/WebSocket.cpp`.
- No test changes — existing WebSocket tests already exercise these
paths and will stop intermittently failing once this lands.
- `bufferedAmount` tracking during the CLOSING/CLOSED send path is still
not implemented; out of scope for this fix.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bghgary bghgary enabled auto-merge (squash) April 16, 2026 23:30
@bghgary bghgary disabled auto-merge April 16, 2026 23:57
@bghgary bghgary marked this pull request as draft April 16, 2026 23:57
@bghgary
Copy link
Copy Markdown
Contributor Author

bghgary commented Apr 16, 2026

[Posted by Copilot on behalf of @bghgary]

Converting to draft while we investigate a Linux-specific hang in the Ubuntu_ThreadSanitizer_clang job.

Symptom

The Run Tests step hangs silently and the job is killed at the 30-min timeout. Recent history of the job on this PR:

Commit Duration Outcome
2df3c81 1.9 min failure (TSan race)
1201a3c 1.8 min failure (TSan race)
f1d4dd1 1.9 min failure (TSan race)
e895125 14.8 min cancelled
7a963f5 18.4 min cancelled
d28ca80 30.3 min cancelled (timeout)

When TSan trips on a race early, the run fails in ~2 min. When it doesn't, the tests hang. As the race fixes land, the hang is no longer masked.

Step timing on d28ca80

Step Duration
Install packages 18s
Configure CMake 23s
Build Solution 60s
Run Tests 28+ min with no output → cancelled

Build + configure are ~100s. The whole job cost is in the hung test run.

Last log lines

[log]     ✅ should load URL as array buffer
##[error]The Operation will be canceled.

No output for ~28 minutes after the last XMLHTTPRequest test. The next test in the suite is should load a PLY file and parse vertex count from header using TextDecoder (loads app:///Assets/Halo_Believe.ply via XHR, arraybuffer, then TextDecoder on the first 10 KB). Either that test hangs or something between tests does.

Why this is Linux-specific

On the same commit:

  • Ubuntu_ThreadSanitizer_clang: 28+ min hang
  • macOS_Xcode164_ThreadSanitizer: 2.7 min ✅
  • Ubuntu_Sanitizers_clang (ASan/UBSan): 2.4 min ✅
  • Ubuntu_clang (no sanitizer): 1.8 min ✅

Same tests, same TSan. So it is specific to TSan + Linux, i.e. TSan instrumentation layered on top of libcurl and/or libjavascriptcoregtk-4.1 on glibc. macOS uses NSURLSession + macOS JSC; both pass.

Hypotheses

  1. Deadlock in libcurl's threaded resolver or share-interface locks under TSan's pthread interposition (only manifests on glibc+TSan).
  2. Deadlock inside libjavascriptcoregtk-4.1 (2.50.4 on Noble) at a JS↔native boundary that TSan's instrumentation serializes differently.
  3. Test-specific: the PLY asset is ~700 KB; a large TextDecoder decode path under TSan may interact badly with a JSC GC point while a UrlLib callback still holds a lock.

Next steps

  • Add a timeout + gdb -batch -ex 'thread apply all bt' wrapper around ./UnitTests in linux.yml so the next hang produces a stack trace for every thread instead of nothing.
  • Try reproducing locally in WSL Ubuntu 24.04 to get a deterministic stack trace without burning CI minutes.
  • Once we have the stack, decide whether to fix in UrlLib, add a JSC-path suppression, skip the offending test under Linux+TSan, or split the Linux TSan job out of the per-PR pipeline.

Keeping per-PR TSan is still worthwhile — this initiative has already surfaced multiple real races. The bottleneck here is a hang to fix, not TSan overhead: the full instrumented pipeline completes in under 3 min on macOS.

The Ubuntu ThreadSanitizer job on Linux was hitting the 30-minute timeout
~45-75% of runs in a silent hang (no TSan report, no output progress).
Reproduced locally in WSL Ubuntu 24.04 with the exact same packages as CI.

Root cause
----------
JSC's concurrent garbage collector on Linux suspends each mutator thread
at GC safepoints using a SIGUSR1-based protocol:

  Collector Thread            Mutator Thread N
  ----------------            ----------------
  pthread_kill(N, SIGUSR1) -> (signal handler runs, sem_post)
  sem_wait(sem)            <-  (handler returns)

Under ThreadSanitizer, signal delivery is intercepted and serialized.
When the mutator is inside an instrumented section, TSan defers the
SIGUSR1 handler indefinitely. The Collector Thread's sem_wait then
blocks forever, hanging the whole process.

Confirmed with a gdb capture of a hung inferior:

  Thread 33 "ollector Thread":
   BabylonJS#5 __interceptor_sem_wait
   BabylonJS#6 WTF::Thread::suspend(WTF::ThreadSuspendLocker const&)
   BabylonJS#7-BabylonJS#21 [JSC GC stop-the-world path]

  Thread 3 "UnitTests":  <pending SIGUSR1>  (never delivered)

macOS JSC uses Mach thread_suspend() rather than Unix signals, which is
why the macOS TSan job has been passing in ~2.7 min the whole time.

Fix
---
Set JSC_useConcurrentGC=0 for the Ubuntu_ThreadSanitizer job only. This
removes the dedicated Collector Thread; GC runs on the mutator without
any cross-thread signaling.

Also revert the 30-min timeout bump — with the hang fixed the Linux
TSan job should finish in roughly the same time as macOS TSan (~3 min).

Verification
------------
- Default (concurrent GC on) : 9-11 hangs per 20 runs
- JSC_useConcurrentGC=0       : 0 hangs per 30 runs

[Created by Copilot on behalf of @bghgary]

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bghgary
Copy link
Copy Markdown
Contributor Author

bghgary commented Apr 17, 2026

Root cause identified — and fixed in 5c8a4a6

Short version: JSC's concurrent GC on Linux uses SIGUSR1 + sem_wait to stop the world; TSan's signal interception delays SIGUSR1 delivery indefinitely, deadlocking the Collector Thread. Setting JSC_useConcurrentGC=0 for the Ubuntu ThreadSanitizer job only removes the dedicated Collector Thread and eliminates the hang. macOS JSC uses Mach thread_suspend() instead of Unix signals, which is why the macOS TSan job has been passing all along.

Evidence

Reproduced locally in WSL Ubuntu 24.04 (same clang-18, cmake 3.28.3, libjavascriptcoregtk-4.1-dev 2.50.4 as CI). Default config: 9/20 hangs. Captured a hung inferior under gdb:

Thread 33 "ollector Thread":
 #5  __interceptor_sem_wait
 #6  WTF::Thread::suspend(WTF::ThreadSuspendLocker const&)
 #7–#21 [JSC GC stop-the-world path]

Thread 3 "UnitTests":  <pending SIGUSR1>  (signal queued for entire hang,
                                           only surfaced when gdb stopped
                                           execution — handler never ran)

The Collector was waiting on the post that Thread 3's SIGUSR1 handler was supposed to do; TSan never let the handler run.

Fix verification

config runs hangs
default (concurrent GC on) 20 9
JSC_useConcurrentJIT=0 15 11
JSC_useJIT=0 15 4
JSC_useConcurrentGC=0 30 0

Changes in 5c8a4a6

  1. Set JSC_useConcurrentGC=0 in the Ubuntu_ThreadSanitizer_clang env block only (not gcc/Sanitizers/macOS TSan)
  2. Revert the 30-min Linux-TSan timeout bump — expect this job to now finish in roughly the same time as macOS TSan (~3 min)

Marking ready for review. Auto-merge is still disabled — leaving that decision to you after checking CI.

[Created by Copilot on behalf of @bghgary]

@bghgary bghgary marked this pull request as ready for review April 17, 2026 03:43
@bghgary bghgary marked this pull request as draft April 17, 2026 03:47
bghgary and others added 2 commits April 16, 2026 20:54
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
PR BabylonJS#159 migrated the CI pipeline from Azure Pipelines to GitHub Actions
and removed .github/azure-pipelines.yml and .github/jobs/*.yml. This
merge resolves the modify/delete conflicts by accepting the deletions
and re-applies the TSan work on the new workflow structure:

- build-linux.yml: adds enable-thread-sanitizer input, wires
  ENABLE_THREAD_SANITIZER CMake flag, and sets TSAN_OPTIONS plus
  JSC_useConcurrentGC=0 env for the test run (the concurrent GC's
  SIGUSR1/sem_wait suspension deadlocks under TSan).
- build-macos.yml: adds enable-thread-sanitizer input, wires
  ENABLE_THREAD_SANITIZER CMake flag, and sets TSAN_OPTIONS.
- ci.yml: adds Ubuntu_ThreadSanitizer_clang and
  macOS_Xcode164_ThreadSanitizer jobs.

CMakeLists.txt and .github/tsan_suppressions.txt changes from this
branch are preserved unchanged.

[Created by Copilot on behalf of @bghgary]

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bghgary bghgary marked this pull request as ready for review April 17, 2026 15:46
@bghgary bghgary merged commit 4e81f0a into BabylonJS:main Apr 17, 2026
34 of 35 checks passed
@bghgary bghgary deleted the feature/thread-sanitizer branch April 17, 2026 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants